|Tools|
<Using other features|
|Up to Jax: Scanner Generator|
|Internals >
Jax: Scanner Generator Reference
This is a summary of features jax provides, and does not provide.
These are methods that are embedded into the java file after
jax processes the jax specification.
- int jax_next_token() throws IOException
- This is used to start the matching process. It will return
an integer value returned from an action. This function
will return -1 when EOF is reached on the input stream.
If you call break; from within an action, this
function will continue scanning from where it left off instead of
returning.
- void init(InputStream inp) throws IOException
- This is used to prime the lexer, and must be called prior
to calling jax_next_token() or strange things will happen.
- String jax_text()
- This can be used to return a string containing the matched
section of the regular expression.
- void jax_switch_state()
- This is used to switch the scanner into a new state if there are
any %state directives in the file.
- int jax_cur_line
- This is a variable that keeps track of the current line if you
use the %line directive.
- int jax_cur_char
- This is a variable pointing to the current character position in
the input file if you use the %char directive.
In addition, there are a few internal methods that are intended not
to be used except by the lexer.
The driver for jax is in the class sbktech.tools.jax.driver
and calling it without any arguments will print out a synopsis.
java sbktech.tools.jax.driver [-lexFile outputFileName] [-i] inputFileName
Hand it an input file, and it will create by default a file called
lexer.java which you can override with the -lexFile
option.
The -i option generates a case insensitive scanner.
The case of the matched text is preserved, so
jax_text() still returns the original text.
For those that care, despite everything that is mentioned in the
rest of the document, this is the grammar that jax understands.
st ::=
(VERBATIM) ? ((((lexStatement | stateStatement) | LINE_DIRECTIVE) | CHAR_DIRECTIVE)) + (VERBATIM) ?
lexStatement ::=
PATTERN or_expr PATTERN (VERBATIM) ? SEMI
or_expr ::=
cat_expr (OR cat_expr) *
cat_expr ::=
singleton (singleton) *
singleton ::=
(((DOT | CHAR) | fullccl) | PAREN_OPEN or_expr PAREN_CLOSE) (((STAR | PLUS) | QMARK)) ?
fullccl ::=
SQUARE_OPEN (CARET) ? ccl SQUARE_CLOSE
ccl ::=
(((CHAR DASH CHAR | CHAR) | DOT)) *
stateStatement ::=
STATE (NAME) +
This grammar was generated from the jell specification for jax.
This is a catalog of things that I know are different from f/lex about jax's
syntax. Some of them are intentional, others are because I'm too lazy
to fix it, and others are bugs.
- Slashes used to delimit regular expressions. I find this easier
to read than having a combination of whitespace and quotes as lex does.
- Whitespace is not significant in regular expressions. I get confused
embedding multiple blanks. So a special escape sequence for blanks.
- Semicolons terminate a regular expression statement. This ought
not be necessary, but it is, and I'm hoping to come
up with a good excuse for it later.
- Jax insists on calling the slash a PATTERN whenever it finds
an error.
- You need to embed actions in a %{ ... %} pair.
- You need not start an action for a regular expression on the
same line as the pattern.
- Macros to expand regular expressions are not provided.
- Context dependent regular expressions are not provided
- Jax is not leX. That was almost my first acronym but
weird... that name had a spell of bad luck. Specifically,
key missing features is that jax does not permit macros,
translation tables or context dependent regular expressions.
- Only seven bit scanners are generated. I think I plan to
leave it this way, since this approach covers a significant
fraction of scanning tasks efficiently. Maybe use a different
(like lazy NFA to DFA) to handle Unicode.
- jax_cur_line returns the line on which the token ends,
rather than where it begins. This is silly, but that fix hasn't
been put into this release.
On very large scanners, jax can generate code that will trigger a bug
in the 1.02 JDK, which causes the javac compiler to fail with
a UTFDataFormatException on reading a compiled scanner.
Jax will warn if it is about to generate such a scanner, and there
are two (well, three if you count pestering Sun to fix it) solutions.
- Always compile your code by adding the scanner source file to the
compiler arguments.
- Patch the 1.02 JDK in four easy steps (thanks to Luke Gorrie for determining the bug and the fix)
- Unzip the src.zip which comes with the JDK, and locate the
file java/io/DataInputStream.java
- Insert a break; above the default: clause in
the case statement around line 323, and recompile this file.
- Unzip the lib/classes.zip which comes with the JDK,
and replace java/io/DataInputStream.class with the
class file you created in step 2.
- Zip back classes.zip (no compression!) and put it back in
the lib directory
This is only what I know, please do mail me if you are aware of other
additions to this list.
- Jonathan Payne's powerful
regexp
package to do runtime pattern matching.
- Elliot Berk's
JavaLex
which is effectively a rewrite of lex in Java.
|Tools|
<Using other features|
|Up to Jax: Scanner Generator|
|Internals >
KB Sriram
Comments, bug reports: kbs@sbktech.org
Revised: Sat Sep 21 12:59:18 1996
URL: http://www.sbktech.org/jax-ref.html